Bulk RNA-seq Manual v1.2

Youkyung Lim

2025-03-19

Introduction

nf-core RNA-seq v.3.12.0 STAR-Salmon :
This pipeline performs read alignment and transcript quantification of bulk RNA-seq data. Based on nf-core RNA-seq v.3.12.0 STAR-Salmon, a bioinformatics community project for making computational methods portable and reproducible, it runs with Nextflow workflow manager, and has been optimized for differential expression analysis at the gene level. To demonstrate this pipeline, bulk mRNA-seq dataset with biological replicates and technical replicates from a publication is used (cited below). Human genome reference version GRCh38 primary assembly and annotation version Gencode v44 Basic are used. For more information on Gencode Basic: click here

Input files :
* example_DesignFile.csv : Sample names, (relative or absolute) paths to data files, and NGS library strandedness according to the nf-core guideline.
* example_nf-params.json : Pre-defined nf-core parameters for RNA-seq_v1.1 pipeline. It includes paths to the reference genome (.fasta) and annotation (.gtf) files.
* example_run_nextflow.sh : Bash script file for initiating a Nextflow run (a.k.a your magic button). You can adjust the ‘wdir’, ‘proj’ and ‘run’ parameters before initiating the run.

Required changes on example_run_nextflow.sh before initiating your run :
Change the three parameters and save the file. Use a descriptive run name.
* wdir = [working_directory]
* proj = [project_title] e.g. “rnaseq”
* run = [run_name] e.g. “star_Salmon_basic_ylim”
Organize your project as [working_directory]/[project_title]/[run_name]/ so you can save the results from multiple runs under a single project folder. Now this is where your “work” and “result” folders will be created from the RNA-seq_v1.1 run.

Before running the pipeline :
* Check your raw data size & estimate your project folder size : $ du -hsc /path/to/raw/.fastq.gz
Check storage space in the server : $ df -H /home/lety/data/
* Coordinate with other users for optimal memory usage : $ htop

Output files :
* Quality control (.html) : /result/multiqc/star_salmon/multiqc_report.html
* Alignment files (.markdup.sorted.bam) : /result/star_salmon/[sample_name].markdup.sorted.bam
* Count files merged for all samples (.tsv) : /result/star_salmon/salmon.merged.gene_counts.tsv
* Count files per sample (.sf) : /result/star_salmon/[sample_name]/quant.genes.sf

BAM file validation :
We found truncated BAM files even with successful completion message from the Nextflow run. It is because the server storage got full at the time of writing the files. Such error is not detectable by the original pipeline, hence additional checks on all BAM files are now added in example_run_nextflow.sh. Following files can be found in /result/star_salmon/picard_validatesamfile.
* bad_bams.fofn : quick check whether the BAM files have EOF (end-of-file) marks.
* bam_summary.txt : full output of Picard ValidateSamFile
* error_messages.txt : file name and detected error and warning messages extracted from bam_summary.txt. Please check the GATK troubleshooting page for what they mean.

After running the pipeline :
* Please check execution report, multiqc report, bam file validation result to assess successful completion of all the stages.
* Please remove scratch/ and work/ directories in the server.

File tree :
├── [working_directory]/
│…. ├── example_DesignFile.csv
│…. ├── example_nf-params.json
│…. ├── example_run_nextflow.sh
│…. ├── CBBI_RNAseq_manual.rmd
│…. ├── [project_title]/
│…. │…. ├── [run_name]/
│…. │…. │…. ├── result/
│…. │…. │…. │…. ├── pipeline_info/
│…. │…. │…. │…. │…. ├── execution_report_yyyy-mm-dd_hh-mm-ss.html
│…. │…. │…. │…. ├── multiqc/
│…. │…. │…. │…. │…. ├── star_salmon/
│…. │…. │…. │…. │…. │…. ├── multiqc_report.html
│…. │…. │…. │…. ├── star_salmon/
│…. │…. │…. │…. │…. ├── [sample_name].markdup.sorted.bam
│…. │…. │…. │…. │…. ├── [sample_name].Aligned.out.bam
│…. │…. │…. │…. │…. ├── salmon.merged.gene_counts.tsv
│…. │…. │…. │…. │…. ├── salmon.merged.gene_tpm.tsv
│…. │…. │…. │…. │…. ├── [sample_name]/
│…. │…. │…. │…. │…. │…. ├── quant.genes.sf
│…. │…. │…. │…. │…. │…. ├── quant.sf
│…. │…. │…. │…. │…. ├── picard_validatesamfile/
│…. │…. │…. │…. │…. │…. ├── error_messages.txt
│…. │…. │…. ├── work/
│…. │…. │…. ├── scratch/
│…. │…. │…. ├── DEAoutput/
│…. │…. │…. │…. ├── Counts.csv
│…. │…. │…. │…. ├── TPM.csv
│…. │…. │…. │…. ├── plot_*.pdf
│…. │…. │…. │…. ├── DEA_all_genes.csv
│…. │…. │…. │…. ├── DEA_significant_genes.csv
│…. │…. │…. │…. ├── DEA_all_genes_IPA.txt
│…. │…. │…. │…. ├── fgsea_result.txt
│…. │…. │…. │…. ├── fgsea_result_main.txt

Set up for DEA

This R Markdown documents (.RMD) file reads in the output files of the RNA-seq_v1.1 pipeline and detects differentially expressed genes using DESeq2 package. It is assumed that this file is in the same [working_directory]. Any figures and gene lists from the analysis would be saved under [working_directory]/[project_title]/[run_name]/DEAoutput.

seed <- 20230816

## install pacman library if not installed
if ( !require("pacman", character.only = TRUE)) {install.packages("pacman", dependencies = TRUE, quiet = TRUE) }

## load and install libraries using pacman
pacman::p_load("dplyr", "ggplot2", "ggfortify", "DESeq2", "IHW", "apeglm", "ashr", "xlsx", "pheatmap", "RColorBrewer", "gridExtra", "ggvenn", "dendextend", "seriation", "Rtsne", "tximport", "tidyr", "plyr", "ggvenn", "ggrepel", "plotly", "ggplotify", "goseq", "stringr", "msigdbr", "clusterProfiler", "fgsea", "data.table", "corrplot", "GenomicTools.fileHandler", "AnnotationDbi", "org.Hs.eg.db", "DT")

'%notin%'<- Negate('%in%')

## aesthetics
color <- list(Condition = c("cornflowerblue","orange2"), 
              Replicate = c("royalblue4","mediumseagreen","mediumorchid3"), 
              Extra = c("darkgreen", "firebrick3", "darkcyan", "deeppink2"))

Theme.YL <- ggplot2::theme(text = element_text(size = 14),  # family = "Sans", 
                  plot.title = element_text(size = rel(1.3), hjust=0.5),
                  plot.subtitle = element_text(size = rel(1), hjust=0.5),
                  # axis.title.x = element_text(size = rel(1)),
                  # axis.title.y = element_text(size = rel(1)),
                  legend.text = element_text(size = rel(0.8)),
                  panel.background = element_rect(fill = 'white', colour = NA),
                  panel.border = element_rect(fill = NA, colour = 'grey20'),
                  strip.background = element_rect(colour = 'grey20', fill = 'white'),
                  panel.grid.major = element_line(colour = 'grey70', linetype = 'dotted'),
                  plot.margin = unit(c(2,2,2,2), "cm"))

Create SampleSheet

SampleSheet contains sample name, various file paths, condition (i.e. CC_CR and CC), biological replicates (i.e. rep1, rep2, and rep3), and technical replicates (i.e. s1, s2, s3, and s4).

wdir <- here::here() # working directory
if (!dir.exists(wdir)) {
    stop("Directory does not exist: ", wdir)
}
proj <- "rnaseq"  # your project title 
runname <- "star_Salmon_basic_ylim"  # descriptive run name 
outdir <- paste0(wdir, "/", "DEAoutput/")  #"DEAoutput/" # where DEA results will be saved

## create output directory if it does not exist yet
ifelse(!dir.exists(file.path(outdir)), dir.create(file.path(outdir)), FALSE)
## [1] FALSE
## create SampleSheet from the design file
SampleSheet <- read.csv2(paste0(wdir, "/", "example_DesignFile.csv"), sep = ",")

## condition
SampleSheet$condition <- unlist(lapply(strsplit(SampleSheet$sample, "_"), 
                                       function(x) paste(x[1:(length(x)-2)], collapse = "_")))
SampleSheet <- SampleSheet %>% mutate(condition = as.factor(condition))

## biological replicate
SampleSheet$bio_rep <- unlist(lapply(strsplit(SampleSheet$sample, "_"), 
                                     function(x) ifelse(length(x)==4, paste(x[1:(length(x)-1)], collapse = "_"), paste(x[1:(length(x)-1)], collapse = "_"))))

## technical replicate
SampleSheet$tech_rep <- unlist(lapply(strsplit(SampleSheet$sample, "_"), 
                                      function(x) ifelse(length(x)==4, paste(x[3:(length(x)-1)], collapse = "_"), paste(x[2:(length(x)-1)], collapse = "_"))))

## path to the count files
SampleSheet$Salmon_basic_gene_counts <- unlist(lapply(SampleSheet$sample, 
                                                      function(x) paste0(wdir, "/", proj, "/", runname, "/result/star_salmon/",x,"/quant.genes.sf")))
SampleSheet$Salmon_basic_transcript_counts <- unlist(lapply(SampleSheet$sample, 
                                                            function(x) paste0(wdir, "/", proj, "/", runname, "/result/star_salmon/",x,"/quant.sf")))

## SampleSheet
# DT::datatable(SampleSheet,
#   caption = htmltools::tags$caption(style = 'caption-side: top; text-align: left;', htmltools::em('SampleSheet')))

Load in count matrix

Import the output of Salmon (i.e. quant.genes.sf), which includes relative abundance (=TPM, Transcripts Per Million), estimated counts, and effective length.

## Import the count files using tximport. tximport will automatically group the data on predefined groups. Filepaths and SampleTitle must be in the same order!
GetCounts_mRNA <- function(FilePaths, SampleTitle, QuantType) {
  genIdCol <- ifelse(QuantType == "rsem", "gene_id", "Name")
  Temp.Files <- FilePaths
  names(Temp.Files) <- SampleTitle
  tximport_Counts <- tximport::tximport(Temp.Files, type=QuantType, txIn = F, txOut = F, geneIdCol = genIdCol)
  tximport_Counts[["length"]][which(tximport_Counts[["length"]] == 0)] <- 1
  return(tximport_Counts)
}

## gene-level.
Salmon.basic.Counts <- GetCounts_mRNA(SampleSheet$Salmon_basic_gene_counts, SampleSheet$sample, "salmon")

## transcript-level.
# Salmon.basic.trans.Counts <- GetCounts_mRNA(SampleSheet$Salmon_basic_transcript_counts, SampleSheet$sample, "salmon")

## save TPM and Counts into CSV files
write.csv(Salmon.basic.Counts$abundance, file = paste0(outdir,"TPM.csv"))
write.csv(Salmon.basic.Counts$counts, file = paste0(outdir,"Counts.csv"))

Inspect the data

Distribution of genes with zero count across all samples. The genes that are not detected in all 24 samples can be filtered out before performing DESeq.

## row-wise (per gene) frequency of zeros
df <- data.frame(Salmon.basic.Counts$counts)
df$numZero <- rowSums(df == 0)
df$row.names <- rownames(df)

ggplot(df[,c('row.names', 'numZero')], aes(x=numZero)) + 
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3))) + 
  stat_bin(binwidth=1, center=0, geom="text", aes(label=after_stat(count)), vjust=-0.5, closed="left") + 
  labs(title = "Distribution of genes with zero counts", x = "Number of zero counts across samples", y = "Frequency") + 
  theme(plot.title = element_text(hjust = 0.5)) + Theme.YL

# Differential Expression Analysis ## The two conditions : CC-CR vs. CC “Cetuximab and panitumumab are epidermal growth factor receptor (EGFR) monoclonal antibodies (mAbs) that bind the extracellular domain of EGFR and enhance receptor internalization and degradation. These EGFR mAbs are common targeted agents for patients with wild-type KRAS metastatic colorectal cancer (CRC). … We generated a cell line from colonies with cystic morphology, which we designated cystic colonies (CC) and cetuximab-resistant CC (CC-CR). CC tumors were well differentiated and regressed following administration of cetuximab. CC-CR tumors were poorly differentiated and continued to grow in the presence of cetuximab, although not to the extent of untreated tumors.”

## Create colData
coldata <- SampleSheet[,c("sample","condition","bio_rep", "tech_rep")]

## factorize $condition with "CC" condition as reference
coldata$condition <- factor(coldata$condition)
coldata$condition <- relevel(coldata$condition, ref = "CC")
rownames(coldata) <- coldata$sample
coldata$sample <- NULL

## Reorder counts to the coldata
Salmon.basic.Counts.Ordered <- lapply(Salmon.basic.Counts[1:length(Salmon.basic.Counts)-1], function(x) data.frame(x) %>% dplyr::select(rownames(coldata)) %>% as.matrix())
Salmon.basic.Counts.Ordered$countsFromAbundance <- Salmon.basic.Counts$countsFromAbundance

## create DDS (DESeq Data Set)
dds.Salmon.basic <- DESeq2::DESeqDataSetFromTximport(txi = Salmon.basic.Counts.Ordered, 
                                                    colData = coldata,
                                                    design = ~condition)
dds.Salmon.basic
## class: DESeqDataSet 
## dim: 62754 24 
## metadata(1): version
## assays(2): counts avgTxLength
## rownames(62754): ENSG00000278625.1 ENSG00000276017.1 ...
##   ENSG00000112062.12 ENSG00000258289.10
## rowData names(0):
## colnames(24): CC_rep1_s1 CC_rep1_s2 ... CC_CR_rep3_s3 CC_CR_rep3_s4
## colData names(3): condition bio_rep tech_rep

Pre-filtering

## filter out the genes with zero count across all samples
keep <- rowSums(counts(dds.Salmon.basic)) > 0
dds.Salmon.basic <- dds.Salmon.basic[keep,]
## (optional) filter out mitochondrial genes
# Create a mapping from Ensembl gene IDs to gene names so we can remove MT- genes
# gtf.path <- ""
gtf.data <- GenomicTools.fileHandler::importGTF(gtf.path, level="gene", features=c("gene_id", "gene_name", "gene_type"))
## Automatically detected number of rows to skip:  5 
## List of features in column 9:
## -----------------------------
## gene_id 
## gene_type 
## gene_name 
## level 
## tag
gene.map <- gtf.data[, c("gene_id", "gene_name")]

# Merge DESeqDataSet gene IDs with gene names from GTF
gene.map <- merge(data.frame(gene_id = rownames(dds.Salmon.basic)), 
                  gene.map, 
                  by = "gene_id", all.x = TRUE)

# Identify mitochondrial genes by gene name (adjust this according to your dataset)
mito.genes <- grep("^MT-", gene.map$gene_name, value = TRUE)

# Filter out mitochondrial genes from the DESeqDataSet
dds.Salmon.basic.filtered <- dds.Salmon.basic[!rownames(dds.Salmon.basic) %in% 
                                                gene.map$gene_id[gene.map$gene_name %in% mito.genes], ]
## (optional) pre-filtering based on biotypes (gene_type) from the gtf file
## https://www.gencodegenes.org/pages/biotypes.html
biotype.map <- gtf.data[, c("gene_id", "gene_type")]
unique(biotype.map$gene_type)
##  [1] "lncRNA"                             "transcribed_unprocessed_pseudogene"
##  [3] "unprocessed_pseudogene"             "miRNA"                             
##  [5] "protein_coding"                     "processed_pseudogene"              
##  [7] "snRNA"                              "transcribed_processed_pseudogene"  
##  [9] "misc_RNA"                           "TEC"                               
## [11] "transcribed_unitary_pseudogene"     "snoRNA"                            
## [13] "scaRNA"                             "rRNA_pseudogene"                   
## [15] "unitary_pseudogene"                 "pseudogene"                        
## [17] "rRNA"                               "IG_V_pseudogene"                   
## [19] "scRNA"                              "IG_V_gene"                         
## [21] "IG_C_gene"                          "IG_J_gene"                         
## [23] "sRNA"                               "ribozyme"                          
## [25] "translated_processed_pseudogene"    "vault_RNA"                         
## [27] "TR_C_gene"                          "TR_J_gene"                         
## [29] "TR_V_gene"                          "TR_V_pseudogene"                   
## [31] "TR_D_gene"                          "IG_C_pseudogene"                   
## [33] "TR_J_pseudogene"                    "IG_J_pseudogene"                   
## [35] "IG_D_gene"                          "IG_pseudogene"                     
## [37] "artifact"                           "Mt_tRNA"                           
## [39] "Mt_rRNA"

(optional) pre-filtering to keep only rows that have a count of at least 10 for a minimal number of samples (optional) pre-filtering lowly expressed genes log(tpm+1) > 0.5

## row-wise (per gene) frequency of zeros
df <- data.frame(counts(dds.Salmon.basic.filtered))
df$numZero <- rowSums(df == 0)
df$row.names <- rownames(df)

ggplot(df[,c('row.names', 'numZero')], aes(x=numZero)) + 
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3))) + 
  stat_bin(binwidth=1, center=0, geom="text", aes(label=after_stat(count)), vjust=-0.5, closed="left") + 
  labs(title = "Distribution of genes with zero counts", x = "Number of zero counts across samples", y = "Frequency") + 
  theme(plot.title = element_text(hjust = 0.5)) + Theme.YL

Perform DEA & DESeq2 results

## perform DESeq
dds.Salmon.basic.filtered <- DESeq2::DESeq(dds.Salmon.basic.filtered)
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 57 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
  • “CC” condition as control & FDR < 0.05.
  • cooksCutoff = TRUE (default): This resets p-values of outlier genes to NA, removing their influence on the results.
  • independentFiltering = TRUE (default): This filters out genes with low counts or high variability, helping to focus on more reliable result.
  • filterFun = ihw: This applies the ‘Independent hypothesis weighting’ method to adjust p-values for multiple comparisons, reducing false positives.
## DESeq2 results
FDR = 0.05 
res.Salmon.basic <- DESeq2::results(dds.Salmon.basic.filtered, 
                                    pAdjustMethod = "BH", alpha = FDR, 
                                    contrast = c("condition", "CC_CR", "CC"), 
                                    cooksCutoff = TRUE, 
                                    independentFiltering = TRUE, 
                                    filterFun=ihw) 

## Column names of the DESeq2 results
data.frame(mcols(res.Salmon.basic, use.names=T))
##                        type                                   description
## baseMean       intermediate     mean of normalized counts for all samples
## log2FoldChange      results log2 fold change (MLE): condition CC_CR vs CC
## lfcSE               results         standard error: condition CC CR vs CC
## stat                results         Wald statistic: condition CC CR vs CC
## pvalue              results      Wald test p-value: condition CC CR vs CC
## padj                results                 Weighted BH adjusted p-values
## weight              results                                   IHW weights
## Summary of the DESeq result
summary(res.Salmon.basic)
## 
## out of 34346 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 2545, 7.4%
## LFC < 0 (down)     : 2543, 7.4%
## outliers [1]       : 0, 0%
## [1] see 'cooksCutoff' argument of ?results
## see metadata(res)$ihwResult on hypothesis weighting
## How many adjusted p-values were less than 0.05?
sum(res.Salmon.basic$padj < FDR, na.rm=TRUE)
## [1] 5088
## Meta data
metadata(res.Salmon.basic)
## $alpha
## [1] 0.05
## 
## $ihwResult
## ihwResult object with 34378 hypothesis tests 
## Nominal FDR control level: 0.05 
## Split into 22 bins, based on an ordinal covariate
## 
## $lfcThreshold
## [1] 0

Add gene names to the result using the same gtf file

res.Salmon.basic$gene <- gtf.data$gene_name[match(rownames(res.Salmon.basic), gtf.data$gene_id)]
res.Salmon.basic$ensembl <- gsub('\\..+$', '', rownames(res.Salmon.basic))
res.Salmon.basic$entrez <- AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db, keys=res.Salmon.basic$ensembl, column="ENTREZID", keytype="ENSEMBL", multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns

Data Visualization

Log fold change (LFC) shrinkage

Shrinkage of effect size (LFC estimates) is useful for visualization and ranking of genes. ‘apeglm’ and ‘ashr’ are preferred method over ‘normal’. Please read more about each method from the original papers (cited below) as well as from the Alternative shrinkage estimators and Extended section on shrinkage estimators. As for deciding between apeglm vs. ashr, see this thread and this thread.

## LFC shrinkage methods : type = c("apeglm", "ashr", "normal")
resLFC.Salmon.basic.apeglm <- DESeq2::lfcShrink(dds.Salmon.basic.filtered, res = res.Salmon.basic, type = "apeglm", 
                                                coef = "condition_CC_CR_vs_CC")
## Warning in nbinomGLM(x = x, Y = YNZ, size = size, weights = weightsNZ, offset =
## offsetNZ, : the line search routine failed, unable to sufficiently decrease the
## function value
resLFC.Salmon.basic.ashr <- DESeq2::lfcShrink(dds.Salmon.basic.filtered, res = res.Salmon.basic, type = "ashr")
resLFC.Salmon.basic.normal <- DESeq2::lfcShrink(dds.Salmon.basic.filtered, res = res.Salmon.basic, type = "normal",
                                                coef = "condition_CC_CR_vs_CC")

## reorder by adjusted p-value
resLFC.Salmon.basic.apeglm <- resLFC.Salmon.basic.apeglm[order(resLFC.Salmon.basic.apeglm$padj),]
resLFC.Salmon.basic.ashr <- resLFC.Salmon.basic.ashr[order(resLFC.Salmon.basic.ashr$padj),]
resLFC.Salmon.basic.normal <- resLFC.Salmon.basic.normal[order(resLFC.Salmon.basic.normal$padj),]
## MA plot before/after shrinkage
Plot.MA <- function(ma.data, baseMeanCutoff = 10, FDR = 0.05, title = "", colSig = "black") {

  ma <- data.frame(ma.data)
  ma <- ma %>% dplyr::mutate(isDE = case_when(
    abs(log2FoldChange) > 0.5 & baseMean >= baseMeanCutoff & padj < FDR ~ TRUE,
    TRUE ~ FALSE),
    isDE = as.factor(isDE))
  
  ma <- ma %>% dplyr::arrange(isDE)
  
  ma.plot <- ma %>% data.frame() %>% ggplot2::ggplot(aes(x=baseMean, y=log2FoldChange, color=isDE, shape=isDE, fill=isDE)) +
    geom_point(aes(shape = isDE, color=isDE), size = 3) +
    scale_x_continuous(trans = "log10") +
    scale_color_manual(values = c("grey80", colSig)) +
    scale_shape_manual(values= c(19,16)) +
    geom_vline(xintercept = 50, color = 'grey50', linetype = "dashed") +
    geom_hline(yintercept = c(-.5, .5), color = 'grey50', linetype = "dashed") +
    Theme.YL + ggtitle(label = title)
    
  plot(ma.plot)
  
}

Plot.MA(res.Salmon.basic, title = paste("MA plot (before shrinkage)", 
              "min:", round(range(res.Salmon.basic$log2FoldChange),2)[1], 
              "max:", round(range(res.Salmon.basic$log2FoldChange),2)[2]), 
        colSig = color$Extra[1])

Plot.MA(resLFC.Salmon.basic.apeglm, title = paste("MA plot (after apeglm shrinkage)", 
                                           "min:", round(range(resLFC.Salmon.basic.apeglm$log2FoldChange),2)[1], 
                                           "max:", round(range(resLFC.Salmon.basic.apeglm$log2FoldChange),2)[2]), 
        colSig = color$Extra[2])

Plot.MA(resLFC.Salmon.basic.ashr, title = paste("MA plot (after ashr shrinkage)", 
                                                "min:", round(range(resLFC.Salmon.basic.ashr$log2FoldChange),2)[1], 
                                                "max:", round(range(resLFC.Salmon.basic.ashr$log2FoldChange),2)[2]), 
        colSig = color$Extra[3])

Plot.MA(resLFC.Salmon.basic.normal, title = paste("MA plot (after normal shrinkage)", 
                                                  "min:", round(range(resLFC.Salmon.basic.normal$log2FoldChange),2)[1], 
                                                  "max:", round(range(resLFC.Salmon.basic.normal$log2FoldChange),2)[2]), 
        colSig = color$Extra[4])

Save DEA result

Save DEA result - Shrunk LFC

## choose one shrinkage method
resLFC.Salmon.basic <- resLFC.Salmon.basic.normal
rm(resLFC.Salmon.basic.apeglm, resLFC.Salmon.basic.ashr)

write.table(resLFC.Salmon.basic, file = paste0(outdir,"DEA_all_genes_shrunkLFC.tsv"), sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)

IPA formatted file - after shrinkage of LFC

ipa <- resLFC.Salmon.basic[,c("gene","ensembl","entrez", "baseMean", "log2FoldChange","pvalue", "padj")]
write.table(ipa, file = paste0(outdir,"DEA_all_genes_IPA.txt"), sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)

Volcano Plot

A volcano plot allows quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes.

mean.cutoff = 10
lfc.cutoff = 0.5 
threshold <- resLFC.Salmon.basic$padj < FDR & 
  resLFC.Salmon.basic$baseMean > mean.cutoff & 
  abs(resLFC.Salmon.basic$log2FoldChange) > lfc.cutoff

df <- data.frame(resLFC.Salmon.basic) 
df$Significant <- threshold
df$genelabels <- rownames(df) %in% rownames(df)[df$Significant == TRUE][1:10]

p <- ggplot(df) +
  geom_point(aes(x = log2FoldChange, y = -log10(padj), colour = Significant, alpha = Significant, size = Significant)) +
  scale_color_manual(values=color$Extra) +
  geom_label_repel(aes(x = log2FoldChange, y = -log10(padj), label = ifelse(genelabels == T, df$gene,""))) +
  labs(title = "Volcano Plot (CC_CR vs. CC)", 
       subtitle = paste0("10 significant genes are labelled. LFC cutoff = ", lfc.cutoff, 
                         ", p-value cutoff = ", FDR, ", baseMean cutoff = ", mean.cutoff)) + 
  xlab("log2 fold change") +  ylab("-log10 adjusted p-value") + 
  geom_vline(xintercept = c(lfc.cutoff, -lfc.cutoff),  colour = "grey50", linetype = 'dashed') + 
  geom_hline(yintercept = FDR, colour = "grey50", linetype = 'dashed') + 
  Theme.YL + scale_size_manual(values = c(1,2))
plot(p)

## Warning: Using alpha for a discrete variable is not advised.
## Warning: Use of `df$gene` is discouraged.
## ℹ Use `gene` instead.
## png 
##   2

PlotCounts

Check out counts for a gene of your interest. Normalized counts plus a pseudocount of 0.5 are shown by default.

gene.interest.name =  "MIR100HG" # high variability within CC_CR group 
gene.interest = rownames(resLFC.Salmon.basic[resLFC.Salmon.basic$gene == gene.interest.name,]) 

d <- plotCounts(dds.Salmon.basic.filtered, gene = gene.interest, intgroup = "condition", returnData = TRUE)
d$name <- rownames(d)
d$tech_rep <- SampleSheet$tech_rep

ggplot(d, aes(x=condition, y=count, color=tech_rep)) + scale_color_manual(values=color$Replicate) + scale_fill_manual(values=color$Condition) +
  geom_point(position=position_jitter(w=0.1,h=0)) + geom_boxplot(aes(group = condition, fill= condition),alpha = 0.3) + #geom_text_repel(aes(label = name)) +
  ggtitle(paste(gene.interest.name, "")) + Theme.YL

gene.interest.name =  "RPL5" # low variability within groups
gene.interest = rownames(resLFC.Salmon.basic[resLFC.Salmon.basic$gene == gene.interest.name,]) 

d <- plotCounts(dds.Salmon.basic.filtered, gene = gene.interest, intgroup = "condition", returnData = TRUE)
d$name <- rownames(d)
d$tech_rep <- SampleSheet$tech_rep

ggplot(d, aes(x=condition, y=count, color=tech_rep)) + scale_color_manual(values=color$Replicate) + scale_fill_manual(values=color$Condition) +
  geom_point(position=position_jitter(w=0.1,h=0)) + geom_boxplot(aes(group = condition, fill= condition),alpha = 0.3) + #geom_text_repel(aes(label = name)) +
  ggtitle(paste(gene.interest.name, "")) + Theme.YL

Clustering anlaysis

Data Transformation

For visualization or clustering, it might be useful to work with transformed versions of the count data. The regularized log (rlog) function transforms the count data to the log2 scale in a way which minimizes differences between samples for rows with small counts, and which normalizes with respect to library size. The rlog transformation produces a similar variance stabilizing effect as Variance Stabilizing Transformation (vst), though rlog is more robust in the case when the size factors vary widely.

TransformCounts <- function(ddsMatrix, TransformationType = "rlog", blind = TRUE, fitType = "parametric") {
  ## user can choose between rlog and vst as their data transformation method
  ## check if the ddsMatrix contains already calculated dispersion values or not
    if (sum(is.na(SummarizedExperiment::rowData(ddsMatrix)[,"dispersion"])) == 0) {
      transformedCounts <- switch(
        TransformationType,
        "rlog" = DESeq2::rlog(ddsMatrix, fitType = fitType, blind = blind),
        "vst" = DESeq2::varianceStabilizingTransformation(ddsMatrix, fitType = fitType, blind = blind)) 
      } else {
        transformedCounts <- switch(
          TransformationType,
          "rlog" = DESeq2::rlog(estimateDispersions(estimateSizeFactors(ddsMatrix)), fitType = fitType, blind = blind), 
          "vst" = DESeq2::varianceStabilizingTransformation(estimateDispersions(estimateSizeFactors(ddsMatrix)), fitType = fitType, blind = blind))
          }
  hist(assay(ddsMatrix))
  hist(assay(transformedCounts))
  return(transformedCounts)
}
## all genes : N = 34378 
rlog.dds.Salmon.basic <- TransformCounts(dds.Salmon.basic.filtered, "rlog")

## Criteria : baseMean > 10 AND adjusted p-value < 0.05 AND |log2FoldChange| > 0.5
# replace NA values to FC:0 and padj:1, then select significant genes
res.Salmon.basic.Filter <- res.Salmon.basic %>% data.frame() %>%
  dplyr::mutate_at(vars(contains("log2FoldChange")), ~replace(., is.na(.), 0)) %>%
  dplyr::mutate_at(vars(contains("padj")), ~replace(., is.na(.), 1)) %>%
  dplyr::filter(baseMean > mean.cutoff, padj < FDR, abs(log2FoldChange) > lfc.cutoff)
res.Salmon.basic.Filter <- res.Salmon.basic.Filter[order(res.Salmon.basic.Filter$padj),]

## only the significant genes : N = 831 
rlog.dds.Salmon.basic.sig <- TransformCounts(dds.Salmon.basic.filtered[rownames(res.Salmon.basic.Filter),], "rlog") 

dds.Salmon.basic.filtered[rownames(res.Salmon.basic.Filter),]
## class: DESeqDataSet 
## dim: 1008 24 
## metadata(1): version
## assays(8): counts avgTxLength ... replaceCounts replaceCooks
## rownames(1008): ENSG00000122406.14 ENSG00000117523.18 ...
##   ENSG00000186567.14 ENSG00000289463.1
## rowData names(23): baseMean baseVar ... maxCooks replace
## colnames(24): CC_rep1_s1 CC_rep1_s2 ... CC_CR_rep3_s3 CC_CR_rep3_s4
## colData names(4): condition bio_rep tech_rep replaceable

Visualizing clusters with heatmap and dendogram

Using the count matrix with all genes, rlog transformation had been performed, and Euclidean distances among the samples are visualized as a heatmap and hierarchical clustering is performed using complete linkage method. Learn more about how to use pheatmap package: click here

## calculate distance 
sampleDists <- dist(t(assay(rlog.dds.Salmon.basic)), method = "euclidean")
sampleDistMatrix <- as.matrix(sampleDists)

## annotation color
mat_col <- data.frame(Condition = rlog.dds.Salmon.basic$condition, Replicate = rlog.dds.Salmon.basic$tech_rep)
rownames(mat_col) <- colnames(sampleDistMatrix)
mat_colors <- list(Condition = color$Condition, Replicate = color$Replicate)
names(mat_colors$Condition) <- unique(mat_col$Condition)
names(mat_colors$Replicate) <- unique(mat_col$Replicate)

## heatmap
p <- ggplotify::as.ggplot(pheatmap(mat = sampleDistMatrix,
         clustering_distance_rows=sampleDists, clustering_distance_cols=sampleDists,
         clustering_method = "complete",
         color = colorRampPalette(rev(brewer.pal(n = 9, name =  "YlGnBu")))(255),
         border_color = NA, cutree_cols=2, show_rownames = FALSE, fontsize = 15, #fontfamily = "Sans",
         width = ncol(sampleDistMatrix)*unit(7, "mm"), height = nrow(sampleDistMatrix)*unit(7, "mm"),
         cellwidth = 20, cellheight = 16,
         annotation_col = mat_col, annotation_colors = mat_colors, 
         main = "Clustering heatmap using all genes"))

## png 
##   2

Same as the plot above, but using the count matrix with only the significant genes.

## calculate distance 
sampleDists <- dist(t(assay(rlog.dds.Salmon.basic.sig)), method = "euclidean")
sampleDistMatrix <- as.matrix(sampleDists)

## annotation color
mat_col <- data.frame(Condition = rlog.dds.Salmon.basic.sig$condition, Replicate = rlog.dds.Salmon.basic.sig$tech_rep)
rownames(mat_col) <- colnames(sampleDistMatrix)
mat_colors <- list(Condition = color$Condition, Replicate = color$Replicate)
names(mat_colors$Condition) <- unique(mat_col$Condition)
names(mat_colors$Replicate) <- unique(mat_col$Replicate)


## heatmap
p <- ggplotify::as.ggplot(pheatmap(mat = sampleDistMatrix,
         clustering_distance_rows=sampleDists, clustering_distance_cols=sampleDists,
         clustering_method = "complete",
         color = colorRampPalette(rev(brewer.pal(n = 9, name =  "YlGnBu")))(255),
         border_color = NA, cutree_cols=2, show_rownames = FALSE, fontsize = 15, #fontfamily = "Sans",
         width = ncol(sampleDistMatrix)*unit(7, "mm"), height = nrow(sampleDistMatrix)*unit(7, "mm"),
         cellwidth = 20, cellheight = 16,
         annotation_col = mat_col, annotation_colors = mat_colors, 
         main = "Clustering heatmap using Significant genes"))

## png 
##   2

PCA and t-SNE plot

To make the large dataset of 34395 rows and 6 columns, dimensionality reduction techniques like PCA (Principal component analysis) and t-SNE (t-distributed stochastic neighbor embedding) are utilized for data visualization. We can utilize the clustering results as quality control measure of the experiment.

Recreation of Figure 2a

Scaling (z-transformation)

z_Transform <- function(rlog_dds) {
  normalized.counts = SummarizedExperiment::assay(rlog_dds)
  ## gene-wise z-transformation
  m = apply(normalized.counts, 1, mean)
  s = apply(normalized.counts, 1, sd)
  z = data.frame((normalized.counts - m)/s)
  ## add gene names
  # gtf.data <- GenomicTools.fileHandler::importGTF(gtf.path, level="gene", features=c("gene_id", "gene_name", "gene_type"))
  z$gene = gtf.data$gene_name[match(rownames(rlog_dds), gtf.data$gene_id)]
  z$ensembl <- gsub('\\..+$', '', rownames(rlog_dds))
  z$entrez <- AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db, keys=z$ensembl, column="ENTREZID", keytype="ENSEMBL", multiVals = "first")
  return(z)
}


z.counts <- z_Transform(rlog.dds.Salmon.basic)
## 'select()' returned 1:many mapping between keys and columns
write.csv(z.counts, file = paste0(outdir,"gene-wise-z-transformation.csv"))

Recreation of Figure 2a

Heatmap of the top 50 differentially expressed transcripts in CC-CR versus CC from three independent 3D culture experiments. Gene expression values are gene-wise z-transformed and are colored red for high abundance and blue for low abundance, as indicated in the scale bar.

## Top 50 genes from the original paper (fig 2a)
top50 = c("MIR100HG","HS3ST5","A2M","ZNF804A", "ARHGAP29", "SESN3", "SATB1", "SERPINE2", "AKAP12", "EPHB6", "GAS6", "IGFBP6", "CUEDC1", "CGN", "MESDC1", "VIL1", "SNTB1", "RP11-395P17.3", "ALPK3", "LOXL1", "CAPS", "FYN", "H19", "RASGRF1", "CDK6", "PKP1", "DYSF", "HLA-DRA", "ENC1", "QPCT", "IGF2BP2", "AC007952.5", "SPDYC", "HOXD10", "CIB2", "CFTR", "HOXD11", "TEX30", "LINC01018", "SERTAD4", "DKK3", 'GATA6', 'GSTM4', "CPVL", 'DKK1', 'EPB41L3', 'ZNK608', "CRABP1", "SERP2", "OLFM1") 

z.counts.top50 <- z.counts[z.counts$gene %in% top50,]
z.counts.top50 <- z.counts.top50 %>% arrange(factor(gene, levels = top50))
rownames(z.counts.top50) <- z.counts.top50$gene
z.counts.top50$gene <- NULL
z.counts.top50$ensembl <- NULL
z.counts.top50$entrez <- NULL

## Heatmap using gene-wise z-transformed gene expression values
mat_col <- data.frame(sample = rownames(coldata), 
                      Condition = coldata$condition)
rownames(mat_col) <- mat_col$sample
mat_col$sample <- NULL

p <- pheatmap(z.counts.top50, cluster_rows = FALSE,
         border_color = NA, fontsize = 15, 
         width = ncol(z.counts.top50)*unit(7, "mm"), height = nrow(z.counts.top50)*unit(7, "mm"),
         cellwidth = 20, cellheight = 16,
         annotation_col = mat_col, annotation_colors = mat_colors,
         main = "Top 50 differentially expressed genes in CC_CR vs. CC")

PlotCounts of MIR100HG gene

gene.interest.name = "MIR100HG"
gene.interest = rownames(resLFC.Salmon.basic[resLFC.Salmon.basic$gene == gene.interest.name,]) 

d <- plotCounts(dds.Salmon.basic.filtered, gene = gene.interest, intgroup = "condition", returnData = TRUE)
d$name <- rownames(d)
d$tech_rep <- SampleSheet$tech_rep

ggplot(d, aes(x=condition, y=count, color=tech_rep)) + scale_color_manual(values=color$Replicate) + scale_fill_manual(values=color$Condition) +
  geom_point(position=position_jitter(w=0.1,h=0)) + geom_boxplot(aes(group = condition, fill= condition),alpha = 0.3) + #geom_text_repel(aes(label = name)) +
  ggtitle(paste(gene.interest.name, "")) + Theme.YL

## png 
##   2

Functional Analysis

Gene set enrichment analysis (GSEA)

Preranked gene set enrichment analysis (GSEA) is a widely used method for interpretation of gene expression data in terms of biological processes. Utilizing (shrunken) log2 fold changes for all genes, see whether gene sets for particular biological pathways are enriched among the positive or negative fold changes.
MSigDB
Reactome Pathway Database

Ranking LFCs

## Remove NAs 
gseaDat <- resLFC.Salmon.basic[!is.na(resLFC.Salmon.basic$entrez),] 

## Remove duplicated gene names -- 1) significant? 2) highest baseMean
length(gseaDat$entrez[duplicated(gseaDat$entrez)])
## [1] 83
## 1) significant? 
gseaDat <- subset(gseaDat, abs(log2FoldChange) > lfc.cutoff & padj < FDR)

## No duplicates
length(gseaDat$entrez[duplicated(gseaDat$entrez)])
## [1] 0
## Rank all genes based on their shrunken fold change 
geneList <- gseaDat$log2FoldChange
names(geneList) <- gseaDat$entrez

## Plot the ranked fold changes in descending order
geneList <- sort(geneList, decreasing = T)
p <- barplot(geneList, main = "Ranked fold changes of all genes in descending order")

# pdf(paste0(outdir,"plot_rank_LFC.pdf"), width=10, height=8)
# plot(p)
# dev.off()

MSigDB & clusterProfiler

## Choose a category from MSigDB
MSigDB <- msigdbr::msigdbr(species = "Homo sapiens", category = "H")
MSigDB <- MSigDB %>% as.data.frame %>% dplyr::select(gs_name, entrez_gene)

set.seed(20250319)
gsea <- clusterProfiler::GSEA(geneList, TERM2GENE = MSigDB,
                                     minGSSize    = 10,
                                     maxGSSize    = 500,
                                     pvalueCutoff = 0.05,
                                     pAdjustMethod = "BH")
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
## significant gene set
gsea@result %>% dplyr::select(ID, NES, qvalue, setSize, rank) %>% dplyr::arrange(qvalue) %>% head()
##                                                                    ID      NES
## HALLMARK_INTERFERON_GAMMA_RESPONSE HALLMARK_INTERFERON_GAMMA_RESPONSE 2.683109
## HALLMARK_INTERFERON_ALPHA_RESPONSE HALLMARK_INTERFERON_ALPHA_RESPONSE 2.368676
##                                          qvalue setSize rank
## HALLMARK_INTERFERON_GAMMA_RESPONSE 0.0004538009      25  147
## HALLMARK_INTERFERON_ALPHA_RESPONSE 0.0017829376      19  335
openxlsx::write.xlsx(gsea@result %>% dplyr::arrange(qvalue), 
                     file = paste0(outdir, "/gsea.hallmark.xlsx"))

GeneRatio = count / setSize. ‘count’ is the number of genes that belong to a given gene-set, while ‘setSize’ is the total number of genes in the gene-set.

enrichplot::dotplot(gsea, showCategory = 10) + 
  ggplot2::scale_fill_gradient(low = "#132B43",  high = "#56B1F7", limits = c(0, 1), name = "qvalue") 
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

enrichplot::gseaplot(gsea, geneSetID = 1, title = gsea$Description[1])

enrichplot::gseaplot(gsea, geneSetID = 1, title = gsea$Description[2])

reactomePathways

## Pathway analysis
pathways <- fgsea::reactomePathways(names(geneList))  
fgseaRes <- fgsea::fgsea(pathways, geneList, maxSize = 500)

## Top 10 pathways
head(fgseaRes[order(padj, -abs(NES)), ], n=10)[,c(1,3,5)]
##                                          pathway      padj         ES
##                                           <char>     <num>      <num>
##  1:              Interferon alpha/beta signaling 0.3896538  0.5227749
##  2:  Antiviral mechanism by IFN-stimulated genes 0.3896538  0.5275529
##  3:                       Platelet degranulation 0.3896538  0.6107831
##  4: Response to elevated platelet cytosolic Ca2+ 0.3896538  0.6107831
##  5:                    ISG15 antiviral mechanism 0.3896538  0.5944044
##  6:                                 Gastrulation 0.3896538 -0.6505853
##  7:      Disorders of transmembrane transporters 0.3896538 -0.6820418
##  8:                  FOXO-mediated transcription 0.3896538  0.5555556
##  9:                         RHO GTPase Effectors 0.3896538 -0.6323468
## 10:             Crosslinking of collagen fibrils 0.3896538 -0.7974026
## Enrichment plot of the first pathway
pathway.interest = head(fgseaRes[order(padj, -abs(NES)), ], n=10)$pathway[1]
p <- fgsea::plotEnrichment(pathways[[pathway.interest]], geneList) + ggplot2::labs(title=pathway.interest) + Theme.YL

# pdf(paste0(outdir,"plot_pathway_", gsub('\\s|\\/', '_', pathway.interest), ".pdf"), width=10, height=8)
plot(p)

# dev.off()

GO enrichment analysis

Gene Ontology (GO) analysis is used to highlight biological processes. From the GOseq paper : “One simple, but extremely widely used, systems biology technique for highlighting biological processes is gene category over-representation analysis. In order to perform this analysis, genes are grouped into categories by some common biological property and then tested to find categories that are over represented amongst the differentially expressed genes. Gene Ontology (GO) categories are commonly used in this technique and there are many tools available for performing GO analysis. … In order to correct for selection bias in category testing, we propose the following three-step methodology. First, the genes that are significantly DE between conditions are identified. The GOseq method works with any procedure for identifying DE genes. Second, the likelihood of DE as a function of transcript length is quantified. This is obtained by fitting a monotonic function to DE versus transcript length data. Finally, the DE versus length function is incorporated into the statistical test of each category’s significance. This final step takes into account the lengths of the genes that make up each category.”

## Create a list of differentially expressed genes (1=DE, 0=Not DE)
genes <- as.integer(threshold)
names(genes) <- resLFC.Salmon.basic$ensembl
table(genes)
## genes
##     0     1 
## 33702   676
## Fit the Probability Weighting Function (PWF)
gene_lengths <- SummarizedExperiment::assays(dds.Salmon.basic.filtered)[["avgTxLength"]]
gene_lengths$median <- apply(gene_lengths, 1, median)  
pwf <- goseq::nullp(genes, "hg38", "ensGene", bias.data = gene_lengths$median)

## Conduct GO enrichment analysis
goResults <- goseq::goseq(pwf, "hg38", "ensGene", test.cats=c("GO:BP"))
## Plot the top 10
GO_top10 <- goResults %>%
            top_n(10, wt=-over_represented_pvalue) %>%
            mutate(hitsPerc=numDEInCat*100/numInCat) %>%
            ggplot(aes(x=hitsPerc,
                      y=term,
                      colour=over_represented_pvalue,
                      size=numDEInCat)) +
            geom_point() +
            expand_limits(x=0) +
            labs(x="Hits (%)", y="GO term", colour="p value", size="Count", title = "GO enrichment (Top 10)") + Theme.YL

# pdf(paste0(outdir,"plot_GO_top10.pdf"), width=12, height=10)
plot(GO_top10)

# dev.off()

Further Analysis

Qiagen Ingenuity Pathway Analysis (IPA) :
Should you wish to perform the IPA analysis, contact the admin of the software to create your account. You will get an email with activation instruction upon approval. IPA appropriately formatted file can be found at : [working_directory]/[project_title]/[run_name]/DEAoutput/DEA_all_genes_IPA.txt
Please watch the following videos before using the software.
* Data Formatting in IPA : https://tv.qiagenbioinformatics.com/video/50301639/data-formatting-in-ipa
* Data Upload in IPA : https://tv.qiagenbioinformatics.com/video/50301463/data-upload-in-ipa
* Run Core Analysis & Interpret Results : https://tv.qiagenbioinformatics.com/video/64233123/making-your-rnaseq-microarray

Fusion gene analysis using Arriba :
Fusion genes can be detected by utilizing intermediate files of STAR run as input files for Arriba. Upon completion of the RNA-seq pipeline, the files can be found at [working_directory]/[project_title]/[run_name]/result/star_salmon/[sample_name].Aligned.out.bam When you install Arriba using Bioconda, Blacklist, Known fusions, and Protein domains of GRCh38 (and others) are included in your conda environment. The files can be found at [path/to/your/conda/environment]/var/lib/arriba/blacklist_hg38_GRCh38_v2.4.0.tsv.gz

Literature

Literature :
* STAR for read alignment of RNA-seq data:
Alexander Dobin and others, STAR: ultrafast universal RNA-seq aligner, Bioinformatics, Volume 29, Issue 1, January 2013, Pages 15–21, https://doi.org/10.1093/bioinformatics/bts635
https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf
* Salmon for transcript quantification from RNA-seq data:
Patro, R., Duggal, G., Love, M. et al. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14, 417–419 (2017). https://doi.org/10.1038/nmeth.4197
https://combine-lab.github.io/salmon/
* DESeq2 for differential expression analysis based on the negative binomial distribution:
Love, M.I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550 (2014). https://doi.org/10.1186/s13059-014-0550-8
http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#standard-workflow
* DESeq2 shrinkage method ‘ashr’:
Matthew Stephens, False discovery rates: a new deal, Biostatistics, Volume 18, Issue 2, April 2017, Pages 275–294, https://doi.org/10.1093/biostatistics/kxw041
* DESeq2 shrinkage method ‘apeglm’:
Anqi Zhu, Joseph G Ibrahim, Michael I Love, Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences, Bioinformatics, Volume 35, Issue 12, June 2019, Pages 2084–2092, https://doi.org/10.1093/bioinformatics/bty895
* GO enrichment analysis :
Young, M.D., Wakefield, M.J., Smyth, G.K. et al. Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol 11, R14 (2010). https://doi.org/10.1186/gb-2010-11-2-r14
* Pathway analysis :
Fast gene set enrichment analysis, Gennady Korotkevich, Vladimir Sukhov, Nikolay Budin, Boris Shpak, Maxim N. Artyomov, Alexey Sergushichev. doi: https://doi.org/10.1101/060012
* Pathway database :
Marc Gillespie and others, The reactome pathway knowledgebase 2022, Nucleic Acids Research, Volume 50, Issue D1, 7 January 2022, Pages D687–D692, https://doi.org/10.1093/nar/gkab1028
* Example mRNA-seq data :
Lu, Y., Zhao, X., Liu, Q. et al. lncRNA MIR100HG-derived miR-100 and miR-125b mediate cetuximab resistance via Wnt/β-catenin signaling. Nat Med 23, 1331–1341 (2017). https://doi.org/10.1038/nm.4424
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE82236
* Qiagen Ingenuity Pathway Analysis (IPA) :
Bioinformatics, Volume 30, Issue 4, February 2014, Pages 523–530, https://doi.org/10.1093/bioinformatics/btt703
* Arriba for fusion gene analysis :
Uhrig S, Ellermann J, Walther T, et al. Accurate and efficient detection of gene fusions from RNA sequencing data. Genome Res. 2021;31(3):448-460. doi:10.1101/gr.257246.119

Session info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 20.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
##  [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
##  [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
##  [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
##  [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
## [11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8
## 
## time zone: Europe/Amsterdam
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] DT_0.33                          org.Hs.eg.db_3.20.0             
##  [3] AnnotationDbi_1.68.0             GenomicTools.fileHandler_0.1.5.9
##  [5] corrplot_0.95                    data.table_1.16.4               
##  [7] fgsea_1.32.2                     clusterProfiler_4.12.6          
##  [9] msigdbr_7.5.1                    stringr_1.5.1                   
## [11] goseq_1.58.0                     geneLenDataBase_1.42.0          
## [13] BiasedUrn_2.0.12                 ggplotify_0.1.2                 
## [15] plotly_4.10.4                    ggrepel_0.9.6                   
## [17] plyr_1.8.9                       tidyr_1.3.1                     
## [19] tximport_1.34.0                  Rtsne_0.17                      
## [21] seriation_1.5.7                  dendextend_1.19.0               
## [23] ggvenn_0.1.10                    gridExtra_2.3                   
## [25] RColorBrewer_1.1-3               pheatmap_1.0.12                 
## [27] xlsx_0.6.5                       ashr_2.2-63                     
## [29] apeglm_1.28.0                    IHW_1.34.0                      
## [31] DESeq2_1.46.0                    SummarizedExperiment_1.36.0     
## [33] Biobase_2.66.0                   MatrixGenerics_1.18.1           
## [35] matrixStats_1.5.0                GenomicRanges_1.58.0            
## [37] GenomeInfoDb_1.42.3              IRanges_2.40.1                  
## [39] S4Vectors_0.44.0                 BiocGenerics_0.52.0             
## [41] ggfortify_0.4.17                 ggplot2_3.5.1                   
## [43] dplyr_1.1.4                      pacman_0.5.1                    
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.4.2            BiocIO_1.16.0            bitops_1.0-9            
##   [4] filelock_1.0.3           R.oo_1.27.0              tibble_3.2.1            
##   [7] lpsymphony_1.34.0        XML_3.99-0.18            lifecycle_1.0.4         
##  [10] httr2_1.1.0              mixsqp_0.3-54            rprojroot_2.0.4         
##  [13] vroom_1.6.5              lattice_0.22-6           MASS_7.3-64             
##  [16] magrittr_2.0.3           openxlsx_4.2.8           sass_0.4.9              
##  [19] rmarkdown_2.29           jquerylib_0.1.4          yaml_2.3.10             
##  [22] ggtangle_0.0.6           zip_2.3.2                cowplot_1.1.3           
##  [25] DBI_1.2.3                abind_1.4-8              zlibbioc_1.52.0         
##  [28] R.utils_2.12.3           purrr_1.0.4              RCurl_1.98-1.16         
##  [31] yulab.utils_0.2.0        xlsxjars_0.6.1           rappdirs_0.3.3          
##  [34] GenomeInfoDbData_1.2.13  enrichplot_1.26.6        irlba_2.3.5.1           
##  [37] tidytree_0.4.6           reactome.db_1.89.0       codetools_0.2-20        
##  [40] DelayedArray_0.32.0      DOSE_4.0.0               xml2_1.3.6              
##  [43] tidyselect_1.2.1         aplot_0.2.4              farver_2.1.2            
##  [46] UCSC.utils_1.2.0         viridis_0.6.5            TSP_1.2-4               
##  [49] BiocFileCache_2.14.0     rmdformats_1.0.4         GenomicAlignments_1.42.0
##  [52] jsonlite_1.8.9           survival_3.8-3           iterators_1.0.14        
##  [55] bbmle_1.0.25.1           foreach_1.5.2            tools_4.4.2             
##  [58] progress_1.2.3           treeio_1.30.0            Rcpp_1.0.14             
##  [61] glue_1.8.0               SparseArray_1.6.1        here_1.0.1              
##  [64] xfun_0.50                mgcv_1.9-1               qvalue_2.38.0           
##  [67] ca_0.71.1                withr_3.0.2              numDeriv_2016.8-1.1     
##  [70] fastmap_1.2.0            digest_0.6.37            truncnorm_1.0-9         
##  [73] R6_2.6.1                 gridGraphics_0.5-1       colorspace_2.1-1        
##  [76] GO.db_3.20.0             biomaRt_2.62.1           RSQLite_2.3.9           
##  [79] R.methodsS3_1.8.2        generics_0.1.3           rtracklayer_1.66.0      
##  [82] prettyunits_1.2.0        httr_1.4.7               htmlwidgets_1.6.4       
##  [85] S4Arrays_1.6.0           pkgconfig_2.0.3          rJava_1.0-11            
##  [88] gtable_0.3.6             blob_1.2.4               registry_0.5-1          
##  [91] XVector_0.46.0           htmltools_0.5.8.1        bookdown_0.42           
##  [94] scales_1.3.0             png_0.1-8                ggfun_0.1.8             
##  [97] knitr_1.49               rstudioapi_0.17.1        tzdb_0.4.0              
## [100] reshape2_1.4.4           rjson_0.2.23             coda_0.19-4.1           
## [103] nlme_3.1-167             curl_6.2.0               bdsmatrix_1.3-7         
## [106] cachem_1.1.0             parallel_4.4.2           restfulr_0.0.15         
## [109] pillar_1.10.1            vctrs_0.6.5              slam_0.1-55             
## [112] dbplyr_2.5.0             evaluate_1.0.3           readr_2.1.5             
## [115] invgamma_1.1             GenomicFeatures_1.58.0   mvtnorm_1.3-3           
## [118] cli_3.6.4                locfit_1.5-9.11          compiler_4.4.2          
## [121] Rsamtools_2.22.0         rlang_1.1.5              crayon_1.5.3            
## [124] SQUAREM_2021.1           labeling_0.4.3           fdrtool_1.2.18          
## [127] emdbook_1.3.13           fs_1.6.5                 stringi_1.8.4           
## [130] viridisLite_0.4.2        BiocParallel_1.40.0      babelgene_22.9          
## [133] txdbmaker_1.2.1          munsell_0.5.1            Biostrings_2.74.1       
## [136] lazyeval_0.2.2           GOSemSim_2.32.0          Matrix_1.7-2            
## [139] patchwork_1.3.0          hms_1.1.3                bit64_4.6.0-1           
## [142] KEGGREST_1.46.0          igraph_2.1.4             memoise_2.0.1           
## [145] snpStats_1.56.0          bslib_0.9.0              ggtree_3.14.0           
## [148] fastmatch_1.1-6          bit_4.5.0.1              gson_0.1.0              
## [151] ape_5.8-1